#HELIX.RData Download: 
helix_file_path <- "~/Library/CloudStorage/GoogleDrive-bandov@usc.edu/.shortcut-targets-by-id/1oBvDKkpKxGnEoNogWDoXV--27W2spqKh/HELIX_data/HELIX.RData"
load(helix_file_path)
# Metabol_serum.RData Download:
metabol_serum_file_path <- "~/Library/CloudStorage/GoogleDrive-bandov@usc.edu/.shortcut-targets-by-id/1oBvDKkpKxGnEoNogWDoXV--27W2spqKh/HELIX_data/metabol_serum.RData"
load(metabol_serum_file_path)
#Adjusting BMI Category 
#Binning BMI so that predictive power increases 
phenotype <- phenotype %>%
  mutate(hs_bmi_c_cat = ifelse(hs_bmi_c_cat %in% c(1, 2), 0, 
                               ifelse(hs_bmi_c_cat %in% c(3, 4), 1, hs_bmi_c_cat)))

Abstract

This study examines the impact of prenatal environmental exposures on childhood body mass index (BMI) in children aged 6 to 11 years. Using data from the HELIX study, we investigate the influence of prenatal dietary intake and phthalate exposure concentrations on BMI, while controlling for child age at examinations and gestational age at birth. Lasso and Group Lasso regression models to assess the predictive power of these prenatal exposures.

[PUT THE FINAL RESULTS HERE]

Hypothesis

How do prenatal dietary intake and concentrations of prenatal phthalate exposures influence body mass index (BMI) for children aged 6 to 11 years, while controlling for child age, sex, and cohort? Additionally, can urine metabolomics data improve predictive models of BMI using statistical and machine learning tools?

Introduction

HELIX: Building the Early-Life Exposome
HELIX: Building the Early-Life Exposome

The relationship between prenatal environmental exposures and childhood health outcomes is important, especially in understanding how prenatal factors influence long-term health in children. This data project aims to explore the influence of prenatal dietary intake and phthalate exposure concentrations on body mass index (BMI) in children aged 6 to 11 years. Phthalates, commonly found in plastics and personal care products, are known endocrine disruptors that may impact fetal development and childhood growth patterns (Valvi et al., 2020). Previous studies have shown that prenatal exposure to phthalates is associated with an increased risk of obesity and metabolic disorders in children (Luo et al., 2020). By controlling for key variables such as child age at examinations and gestational age at birth, this project seeks to examine the direct and interactive effects of these prenatal exposures on BMI.

Recent studies emphasize the important role of maternal diet during pregnancy, demonstrating that balanced maternal nutrition can mitigate the adverse effects of environmental exposures like phthalates on child BMI, suggesting that improved dietary practices during pregnancy can lead to better health outcomes for children (NCBI, 2024).

Phthalate exposure has been linked to various health issues, including obesity, type II diabetes, thyroid dysfunction, higher blood pressure, precocious puberty, and reproductive effects. It also impacts the respiratory system (allergy, asthma) and nervous system (delayed neurodevelopment, social impairment) (Serrano et al., 2021).

Furthermore, this project looks at whether urine metabolomics data can improve predictive models of BMI using statistical and machine learning approaches. Integrating metabolomics data helps to understand the biochemical pathways linking prenatal exposures (diet and phthalates) to childhood health outcomes. This project aims to provide insights into the prevention and management of childhood obesity and related health conditions.

Methods

Data Preparation

Imputation

Missing values were imputed with the mean for relevant columns.

Data Splitting

To minimize the risk of overfitting, the dataset was split using a 10-fold cross-validation (CV=10) approach.

Variables

Outcome Variable

  • Body Mass Index (BMI) categorized (hs_bmi_c_cat)

Covariates

  • h_cohort
  • e3_sex_None
  • hs_child_age_None

Diet Variables (12 Diet)

  • e3_alcpreg_yn_None
  • h_cereal_preg_Ter
  • h_dairy_preg_Ter
  • h_fastfood_preg_Ter
  • h_folic_t1_None
  • h_fruit_preg_Ter
  • h_meat_preg_Ter
  • h_pamod_t3_None
  • h_fish_preg_Ter
  • h_legume_preg_Ter
  • h_pavig_t3_None
  • h_veg_preg_Ter

Chemical Variables (10 Chemicals)

  • hs_mbzp_madj_Log2
  • hs_mecpp_madj_Log2
  • hs_mep_madj_Log2
  • hs_mibp_madj_Log2
  • hs_mnbp_madj_Log2
  • hs_oxominp_madj_Log2
  • hs_mehhp_madj_Log2
  • hs_meohp_madj_Log2
  • hs_ohminp_madj_Log2
  • hs_mehp_madj_Log2

Meta Variables

Additional biochemical markers from urine metabolomics data. Click link below for more information: https://www.projecthelix.eu/files/HELIX_urine_metabol_DataSummary.pdf

Models

Model 1: Lasso, Ridge, and Elastic Net on Full Model (excluding urine metabolomics)

Evaluates three regularization regression techniques on the full dataset to identify the best model.

Model 2A: Outcome ~ Covariates

Uses the most optimized model from Model 1 on covariates only. (Lasso Regression)

Model 2B: Outcome ~ Covariates + Diet

Uses the most optimized model from Model 1 on covariates and diet variables. (Lasso Regression)

Model 2C: Outcome ~ Covariates + Diet + Chemicals

Examines the effect of covariates, diet, and chemical exposures on BMI. (Group Lasso Regression)

Model 2D: Outcome ~ Covariates + Diet + Chemicals + Meta

Expands Model C by including meta variables from urine metabolomics data. (Group Lasso Regression)

Model 3A Random Forest

Applied Random Forest (RF)to the best performing dataset with metabolomics data.

Model 3B: GBM

Applied to Gradient Boosting Machine (GBM) to the best performing dataset with metabolomics data.


Statistical Analysis

Primary Models

Two primary models were constructed using Lasso and Group Lasso regression techniques:

  • Lasso Regression: Used to identify the most relevant predictors by applying L1 regularization.
  • Group Lasso Regression: Applied to evaluate the combined effect of grouped variables (covariates, diet, chemicals, and meta variables).

Secondary Model

  • Random Forest: Used to determine the best machine learning algorithm for the best performing dataset.
  • Grafient Boosting Machine (GBM): Used to determine the bet machine learning algorthim for the best performing data set.

Model Evaluation

Performance Metrics

  • Area Under the Curve (AUC) of the ROC curve was used to assess the model’s predictive accuracy.
  • ROC Curve Comparison: ROC curves were plotted for both models to compare their performance visually.

Data

The complexity of exposure to environmental contaminants has increased becuase evolving environmental and lifestyle factors. The exposome includes all environmental (non-genetic) exposures an individual encounters from conception through old age. The HELIX (Human Early-Life Exposome) project focuses on the early-life exposome, which integrates all environmental hazards that mothers and children are exposed to and linking these exposures to health, growth, and development risks.

Pregnancy and early childhood are critical times when children are more vulnerable to environmental damage, which can have lifelong effects. Understanding the exposome during these periods can help prevent diseases, since early interventions can change biological foundation and promote healthy development. HELIX cna help show how different environmental exposures together affect health outcomes and risks.

There are six existing European birth cohort studies: Born in Bradford (BiB), Etude des Déterminants pré et postnatals du développement et de la santé de l’Enfant (EDEN), INfancia y Medio Ambiente (INMA), Kaunas Cohort (KANC), Norwegian Mother, Father and Child Cohort Study (MoBa), and Rhea Mother-Child Cohort Study. These cohorts have collected extensive data from national and EU-funded projects. HELIX supplements this data with advanced tools and methods to measure and integrate the chemical, physical, and molecular environment, linking these measurements to child health outcomes.

Smartphones are utilized to measure air pollution, UV radiation, physical activity, and noise exposure. Advanced laboratory techniques identify biological markers of various chemical exposures, such as contaminants in food, consumer products, and water. HELIX has gathered extensive exposome data from mothers and children, making it the largest study on this topic. The study design is multilevel: the first level includes 31,472 mother-child pairs recruited during pregnancy across the six cohorts; the second level consists of a subcohort of 1301 mother-child pairs with detailed measurements of biomarkers, omics signatures, and health outcomes at ages 6-11 years; and the third level involves repeat-sampling panel studies with about 150 children and 150 pregnant women to collect personal exposure data.

This research focuses on a subcohort of 1301 mother-child pairs to explore questions related to environmental exposures, omic data, and their impact on health outcomes. Specifically, the project will examine urine metabolomics data with Body Mass Index (BMI) as the primary outcome of interest. The goal of this project is to provide a deeper understanding of how early-life environmental exposures influence BMI, potentially providing more information on how to apply early intervention and disease prevention.

For more details on the study design see Vrijheid, Slama, et al. EHP 2014. see https://www.projecthelix.eu/index.php/es/data-inventory for more information regarding the study.

Codebook

codebook_file_path <- "~/Library/CloudStorage/GoogleDrive-bandov@usc.edu/.shortcut-targets-by-id/1oBvDKkpKxGnEoNogWDoXV--27W2spqKh/HELIX_data/HELIX.RData"
load(codebook_file_path)
#  Chemicals
filtered_codebook_chemicals <- codebook %>%
  filter(domain == "Chemicals" & 
         family == "Phthalates" & 
         period == "Pregnancy" & 
         variable_name != "hs_sumDEHP_madj_Log2")

# Covariates 
filtered_codebook_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("e3_sex_None", "h_cohort", "hs_child_age_None"))

# Phenotype
filtered_codebook_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_bmi_c_cat"))

# Lifestyle
filtered_codebook_lifestyles <- codebook %>%
  filter(domain == "Lifestyles" & period == "Pregnancy" & subfamily == "Diet")

# Combining all the information 
combined_codebook <- bind_rows(
  filtered_codebook_chemicals,
  filtered_codebook_covariates,
  filtered_codebook_phenotype,
  filtered_codebook_lifestyles
)

Data Summary Exposures: Lifestyles (Diet)

#Lifestyle 
filtered_codebook_lifestyles <- codebook %>%
  filter(domain == "Lifestyles" & period == "Pregnancy")
selectExposures <- filtered_codebook_lifestyles$variable_name
summarytools::view(dfSummary(exposome[,names(exposome) %in% selectExposures], 
                             style = 'grid', 
                             plain.ascii = FALSE, 
                             valid.col = FALSE, 
                             headings = FALSE), 
                   method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 e3_alcpreg_yn_None [factor]
1. 0
2. 1
896(68.9%)
405(31.1%)
0 (0.0%)
2 h_cereal_preg_Ter [factor]
1. (0,9]
2. (9,27.3]
3. (27.3,Inf]
531(40.8%)
459(35.3%)
311(23.9%)
0 (0.0%)
3 h_dairy_preg_Ter [factor]
1. (0,17.1]
2. (17.1,27.1]
3. (27.1,Inf]
270(20.8%)
380(29.2%)
651(50.0%)
0 (0.0%)
4 h_fastfood_preg_Ter [factor]
1. (0,0.25]
2. (0.25,0.83]
3. (0.83,Inf]
94(7.2%)
535(41.1%)
672(51.7%)
0 (0.0%)
5 h_fish_preg_Ter [factor]
1. (0,1.9]
2. (1.9,4.1]
3. (4.1,Inf]
343(26.4%)
490(37.7%)
468(36.0%)
0 (0.0%)
6 h_folic_t1_None [factor]
1. 0
2. 1
606(46.6%)
695(53.4%)
0 (0.0%)
7 h_fruit_preg_Ter [factor]
1. (0,0.6]
2. (0.6,18.2]
3. (18.2,Inf]
6(0.5%)
922(70.9%)
373(28.7%)
0 (0.0%)
8 h_legume_preg_Ter [factor]
1. (0,0.5]
2. (0.5,2]
3. (2,Inf]
245(18.8%)
269(20.7%)
787(60.5%)
0 (0.0%)
9 h_meat_preg_Ter [factor]
1. (0,6.5]
2. (6.5,10]
3. (10,Inf]
427(32.8%)
387(29.7%)
487(37.4%)
0 (0.0%)
10 h_pamod_t3_None [factor]
1. None
2. Often
3. Sometimes
4. Very Often
42(3.2%)
474(36.4%)
191(14.7%)
594(45.7%)
0 (0.0%)
11 h_pavig_t3_None [factor]
1. High
2. Low
3. Medium
47(3.6%)
952(73.2%)
302(23.2%)
0 (0.0%)
12 h_veg_preg_Ter [factor]
1. (0,8.8]
2. (8.8,16.5]
3. (16.5,Inf]
539(41.4%)
470(36.1%)
292(22.4%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2024-07-30

Data Summary Exposures: Chemicals (Phtlates)

#Chemical
filtered_codebook_chemicals <- codebook %>%
  filter(domain == "Chemicals" & 
         family == "Phthalates" & 
         period == "Pregnancy" & 
         variable_name != "hs_sumDEHP_madj_Log2")
selectExposures <- filtered_codebook_chemicals$variable_name
summarytools::view(dfSummary(exposome[,names(exposome) %in% selectExposures], 
                             style = 'grid', 
                             plain.ascii = FALSE, 
                             valid.col = FALSE, 
                             headings = FALSE), 
                   method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_mbzp_madj_Log2 [numeric]
Mean (sd) : 3 (1.6)
min ≤ med ≤ max:
-3.7 ≤ 2.9 ≤ 9.3
IQR (CV) : 2.2 (0.5)
873 distinct values 0 (0.0%)
2 hs_mecpp_madj_Log2 [numeric]
Mean (sd) : 5 (1)
min ≤ med ≤ max:
2.4 ≤ 4.9 ≤ 10.4
IQR (CV) : 1.3 (0.2)
734 distinct values 0 (0.0%)
3 hs_mehhp_madj_Log2 [numeric]
Mean (sd) : 4.2 (1.2)
min ≤ med ≤ max:
-0.5 ≤ 4.1 ≤ 9.9
IQR (CV) : 1.3 (0.3)
890 distinct values 0 (0.0%)
4 hs_mehp_madj_Log2 [numeric]
Mean (sd) : 2.9 (1.4)
min ≤ med ≤ max:
-7.5 ≤ 3.1 ≤ 8.7
IQR (CV) : 2 (0.5)
873 distinct values 0 (0.0%)
5 hs_meohp_madj_Log2 [numeric]
Mean (sd) : 3.8 (1.2)
min ≤ med ≤ max:
0 ≤ 3.7 ≤ 9.6
IQR (CV) : 1.3 (0.3)
894 distinct values 0 (0.0%)
6 hs_mep_madj_Log2 [numeric]
Mean (sd) : 7.8 (1.8)
min ≤ med ≤ max:
3.3 ≤ 7.8 ≤ 14.1
IQR (CV) : 2.5 (0.2)
874 distinct values 0 (0.0%)
7 hs_mibp_madj_Log2 [numeric]
Mean (sd) : 5.3 (1.1)
min ≤ med ≤ max:
0.9 ≤ 5.3 ≤ 9.5
IQR (CV) : 1.3 (0.2)
871 distinct values 0 (0.0%)
8 hs_mnbp_madj_Log2 [numeric]
Mean (sd) : 5 (1.2)
min ≤ med ≤ max:
-0.7 ≤ 4.9 ≤ 12.7
IQR (CV) : 1.4 (0.2)
897 distinct values 0 (0.0%)
9 hs_ohminp_madj_Log2 [numeric]
Mean (sd) : -0.3 (1.5)
min ≤ med ≤ max:
-11.5 ≤ -0.2 ≤ 6.1
IQR (CV) : 1 (-5.1)
730 distinct values 0 (0.0%)
10 hs_oxominp_madj_Log2 [numeric]
Mean (sd) : -0.1 (1.6)
min ≤ med ≤ max:
-11.6 ≤ 0 ≤ 5.6
IQR (CV) : 1.2 (-28.7)
766 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2024-07-30

Data Summary Exposures: Covariate

# Covariates 
filtered_codebook_covariates <- codebook %>%
  filter(variable_name %in% c("e3_sex_None", "h_cohort", "hs_child_age_None"))
selectCovariates <- filtered_codebook_covariates$variable_name
summarytools::view(dfSummary(covariates[,names(covariates) %in% selectCovariates], 
                             style = 'grid', 
                             plain.ascii = FALSE, 
                             valid.col = FALSE, 
                             headings = FALSE), 
                   method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 h_cohort [factor]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
202(15.5%)
198(15.2%)
224(17.2%)
207(15.9%)
272(20.9%)
198(15.2%)
0 (0.0%)
2 e3_sex_None [factor]
1. female
2. male
608(46.7%)
693(53.3%)
0 (0.0%)
3 hs_child_age_None [numeric]
Mean (sd) : 8 (1.6)
min ≤ med ≤ max:
5.4 ≤ 8 ≤ 12.1
IQR (CV) : 2.4 (0.2)
879 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2024-07-30

Data Exploration

# Combining covariates with phenotype (Full Data)
# Lifestyle variables
filtered_codebook_lifestyles <- codebook %>%
  filter(domain == "Lifestyles" & period == "Pregnancy")
selectExposures_lifestyle <- filtered_codebook_lifestyles$variable_name

# Chemical variables
filtered_codebook_chemicals <- codebook %>%
  filter(domain == "Chemicals" & family == "Phthalates" & period == "Pregnancy" & variable_name != "hs_sumDEHP_madj_Log2")
selectExposures_chemicals <- filtered_codebook_chemicals$variable_name

# Covariate variables
filtered_codebook_covariates <- codebook %>%
  filter(variable_name %in% c("e3_sex_None", "h_cohort", "hs_child_age_None"))
selectCovariates <- filtered_codebook_covariates$variable_name

# Phenotype variables
filtered_codebook_phenotype <- codebook %>%
  filter(domain == "Phenotype" & variable_name %in% c("hs_bmi_c_cat"))
selectPhenotypes <- filtered_codebook_phenotype$variable_name
all_selected_variables <- c("ID", selectExposures_lifestyle, selectExposures_chemicals, selectCovariates, selectPhenotypes, "age")

# Subset the data
subset_exposome <- exposome %>% dplyr::select(all_of(selectExposures_lifestyle), all_of(selectExposures_chemicals))
subset_covariates <- covariates %>% dplyr::select(all_of(selectCovariates))
subset_phenotype <- phenotype %>% dplyr::select(all_of(selectPhenotypes))

# Final Merge
exposome_phenotype_covariates <- exposome %>%
  dplyr::select(ID, all_of(selectExposures_lifestyle), all_of(selectExposures_chemicals)) %>%
  left_join(covariates %>% dplyr::select(ID, all_of(selectCovariates)), by = "ID") %>%
  left_join(phenotype %>% dplyr::select(ID, all_of(selectPhenotypes)), by = "ID")

# Binning BMI
exposome_phenotype_covariates <- exposome_phenotype_covariates %>%
  mutate(hs_bmi_c_cat = ifelse(hs_bmi_c_cat %in% c(1, 2), 0, 
                               ifelse(hs_bmi_c_cat %in% c(3, 4), 1, hs_bmi_c_cat)))
#Creating Full and Important Data
#Initial Model (Model 1)

full_variables <- c("e3_alcpreg_yn_None", "h_cereal_preg_Ter", "h_dairy_preg_Ter", "h_fastfood_preg_Ter",
                    "h_fish_preg_Ter", "h_folic_t1_None", "h_fruit_preg_Ter", "h_legume_preg_Ter",
                    "h_meat_preg_Ter", "h_pamod_t3_None", "h_pavig_t3_None", "h_veg_preg_Ter",
                    "hs_mbzp_madj_Log2", "hs_mecpp_madj_Log2", "hs_mehhp_madj_Log2", "hs_mehp_madj_Log2",
                    "hs_meohp_madj_Log2", "hs_mep_madj_Log2", "hs_mibp_madj_Log2", "hs_mnbp_madj_Log2",
                    "hs_ohminp_madj_Log2", "hs_oxominp_madj_Log2", "e3_sex_None", "h_cohort",
                    "hs_child_age_None", "hs_bmi_c_cat")


# Variable extracted based on Model 1.

important_vars <- c("e3_alcpreg_yn_None", "h_cereal_preg_Ter", "h_dairy_preg_Ter", 
                    "h_fastfood_preg_Ter", "h_folic_t1_None", "h_fruit_preg_Ter", "h_meat_preg_Ter", 
                    "h_pamod_t3_None", "hs_mbzp_madj_Log2", "hs_mecpp_madj_Log2", "hs_mep_madj_Log2", 
                    "hs_mibp_madj_Log2", "hs_mnbp_madj_Log2", "hs_oxominp_madj_Log2", "h_cohort",
                    "e3_sex_None", "hs_child_age_None", "hs_bmi_c_cat")


full_dataset <- exposome_phenotype_covariates[full_variables]
important_dataset <- exposome_phenotype_covariates[important_vars]


full_dataset$ID <- seq_len(nrow(full_dataset))
important_dataset$ID <- seq_len(nrow(important_dataset))
#Merged Data + Meta:
transposed_data <- t(metabol_serum.d)
transposed_data <- as.data.frame(transposed_data)
transposed_data$ID <- rownames(transposed_data)
transposed_data <- transposed_data[, c(ncol(transposed_data), 1:(ncol(transposed_data)-1))]
rownames(transposed_data) <- NULL


full_data_with_meta <- merge(full_dataset,transposed_data , by = "ID", all = TRUE)
var_imp_data_with_meta <- merge(important_dataset, transposed_data, by = "ID", all = TRUE)

Table by Cohort:

exposome_phenotype_covariates <- exposome_phenotype_covariates %>%
  mutate(
    e3_sex_None = factor(e3_sex_None),
    h_cohort = factor(h_cohort),
    hs_bmi_c_cat = factor(hs_bmi_c_cat)
  )

# Labels
label(exposome_phenotype_covariates$e3_sex_None) <- "Sex"
label(exposome_phenotype_covariates$h_cohort) <- "Cohort"
label(exposome_phenotype_covariates$hs_child_age_None) <- "Child's Age"
label(exposome_phenotype_covariates$e3_alcpreg_yn_None) <- "Alcohol During Pregnancy"
label(exposome_phenotype_covariates$h_cereal_preg_Ter) <- "Cereal Intake During Pregnancy"
label(exposome_phenotype_covariates$h_dairy_preg_Ter) <- "Dairy Intake During Pregnancy"
label(exposome_phenotype_covariates$h_fastfood_preg_Ter) <- "Fast Food Intake During Pregnancy"
label(exposome_phenotype_covariates$h_fish_preg_Ter) <- "Fish Intake During Pregnancy"
label(exposome_phenotype_covariates$h_folic_t1_None) <- "Folic Acid Intake"
label(exposome_phenotype_covariates$h_fruit_preg_Ter) <- "Fruit Intake During Pregnancy"
label(exposome_phenotype_covariates$h_legume_preg_Ter) <- "Legume Intake During Pregnancy"
label(exposome_phenotype_covariates$h_meat_preg_Ter) <- "Meat Intake During Pregnancy"
label(exposome_phenotype_covariates$h_pamod_t3_None) <- "Physical Activity (Moderate)"
label(exposome_phenotype_covariates$h_pavig_t3_None) <- "Physical Activity (Vigorous)"
label(exposome_phenotype_covariates$h_veg_preg_Ter) <- "Vegetable Intake During Pregnancy"
label(exposome_phenotype_covariates$hs_mbzp_madj_Log2) <- "MBzP (Log2)"
label(exposome_phenotype_covariates$hs_mecpp_madj_Log2) <- "MECPP (Log2)"
label(exposome_phenotype_covariates$hs_mehhp_madj_Log2) <- "MEHHP (Log2)"
label(exposome_phenotype_covariates$hs_mehp_madj_Log2) <- "MEHP (Log2)"
label(exposome_phenotype_covariates$hs_meohp_madj_Log2) <- "MEOHP (Log2)"
label(exposome_phenotype_covariates$hs_mep_madj_Log2) <- "MEP (Log2)"
label(exposome_phenotype_covariates$hs_mibp_madj_Log2) <- "MiBP (Log2)"
label(exposome_phenotype_covariates$hs_mnbp_madj_Log2) <- "MnBP (Log2)"
label(exposome_phenotype_covariates$hs_ohminp_madj_Log2) <- "OH-MiNP (Log2)"
label(exposome_phenotype_covariates$hs_oxominp_madj_Log2) <- "OXO-MiNP (Log2)"
label(exposome_phenotype_covariates$hs_bmi_c_cat) <- "BMI Category"

# Continuous variables
render_cont <- function(x) {
  sprintf("%.2f (%.2f)", mean(x, na.rm = TRUE), sd(x, na.rm = TRUE))
}

# Categorical variables
render_cat <- function(x) {
  paste0(names(table(x)), " (", table(x), ")", collapse = ", ")
}

# Make sure it's 27 columns
columns <- c("age", "hs_child_age_None", "e3_alcpreg_yn_None", "h_cereal_preg_Ter",
             "h_dairy_preg_Ter", "h_fastfood_preg_Ter", "h_fish_preg_Ter", "h_folic_t1_None",
             "h_fruit_preg_Ter", "h_legume_preg_Ter", "h_meat_preg_Ter", "h_pamod_t3_None",
             "h_pavig_t3_None", "h_veg_preg_Ter", "hs_mbzp_madj_Log2", "hs_mecpp_madj_Log2",
             "hs_mehhp_madj_Log2", "hs_mehp_madj_Log2", "hs_meohp_madj_Log2", "hs_mep_madj_Log2",
             "hs_mibp_madj_Log2", "hs_mnbp_madj_Log2", "hs_ohminp_madj_Log2", "hs_oxominp_madj_Log2", 
             "hs_bmi_c_cat", "e3_sex_None", "h_cohort")

# Stratified by cohort
table1(~  hs_child_age_None  + e3_alcpreg_yn_None + h_cereal_preg_Ter + h_dairy_preg_Ter +
         h_fastfood_preg_Ter + h_fish_preg_Ter + h_folic_t1_None + h_fruit_preg_Ter +
         h_legume_preg_Ter + h_meat_preg_Ter + h_pamod_t3_None + h_pavig_t3_None +
         h_veg_preg_Ter + hs_mbzp_madj_Log2 + hs_mecpp_madj_Log2 + hs_mehhp_madj_Log2 +
         hs_mehp_madj_Log2 + hs_meohp_madj_Log2 + hs_mep_madj_Log2 + hs_mibp_madj_Log2 +
         hs_mnbp_madj_Log2 + hs_ohminp_madj_Log2 + hs_oxominp_madj_Log2 | h_cohort,
       data = exposome_phenotype_covariates,
       render.continuous = render_cont, render.categorical = render_cat,
       overall = TRUE, topclass = "Rtable1-shade")
1
(N=202)
2
(N=198)
3
(N=224)
4
(N=207)
5
(N=272)
6
(N=198)
TRUE
(N=1301)
Child's Age 6.61 (0.28) 10.82 (0.58) 8.78 (0.58) 6.48 (0.47) 8.46 (0.53) 6.51 (0.30) 7.98 (1.61)
Alcohol During Pregnancy 0 (118), 1 (84) 0 (128), 1 (70) 0 (186), 1 (38) 0 (186), 1 (21) 0 (144), 1 (128) 0 (134), 1 (64) 0 (896), 1 (405)
Cereal Intake During Pregnancy (0,9] (131), (9,27.3] (64), (27.3,Inf] (7) (0,9] (4), (9,27.3] (114), (27.3,Inf] (80) (0,9] (200), (9,27.3] (24), (27.3,Inf] (0) (0,9] (119), (9,27.3] (87), (27.3,Inf] (1) (0,9] (10), (9,27.3] (64), (27.3,Inf] (198) (0,9] (67), (9,27.3] (106), (27.3,Inf] (25) (0,9] (531), (9,27.3] (459), (27.3,Inf] (311)
Dairy Intake During Pregnancy (0,17.1] (0), (17.1,27.1] (49), (27.1,Inf] (153) (0,17.1] (30), (17.1,27.1] (66), (27.1,Inf] (102) (0,17.1] (101), (17.1,27.1] (85), (27.1,Inf] (38) (0,17.1] (0), (17.1,27.1] (43), (27.1,Inf] (164) (0,17.1] (74), (17.1,27.1] (72), (27.1,Inf] (126) (0,17.1] (65), (17.1,27.1] (65), (27.1,Inf] (68) (0,17.1] (270), (17.1,27.1] (380), (27.1,Inf] (651)
Fast Food Intake During Pregnancy (0,0.25] (0), (0.25,0.83] (81), (0.83,Inf] (121) (0,0.25] (5), (0.25,0.83] (25), (0.83,Inf] (168) (0,0.25] (0), (0.25,0.83] (80), (0.83,Inf] (144) (0,0.25] (1), (0.25,0.83] (87), (0.83,Inf] (119) (0,0.25] (88), (0.25,0.83] (170), (0.83,Inf] (14) (0,0.25] (0), (0.25,0.83] (92), (0.83,Inf] (106) (0,0.25] (94), (0.25,0.83] (535), (0.83,Inf] (672)
Fish Intake During Pregnancy (0,1.9] (19), (1.9,4.1] (105), (4.1,Inf] (78) (0,1.9] (107), (1.9,4.1] (62), (4.1,Inf] (29) (0,1.9] (14), (1.9,4.1] (81), (4.1,Inf] (129) (0,1.9] (44), (1.9,4.1] (82), (4.1,Inf] (81) (0,1.9] (45), (1.9,4.1] (98), (4.1,Inf] (129) (0,1.9] (114), (1.9,4.1] (62), (4.1,Inf] (22) (0,1.9] (343), (1.9,4.1] (490), (4.1,Inf] (468)
Folic Acid Intake 0 (90), 1 (112) 0 (182), 1 (16) 0 (41), 1 (183) 0 (107), 1 (100) 0 (171), 1 (101) 0 (15), 1 (183) 0 (606), 1 (695)
Fruit Intake During Pregnancy (0,0.6] (0), (0.6,18.2] (166), (18.2,Inf] (36) (0,0.6] (0), (0.6,18.2] (147), (18.2,Inf] (51) (0,0.6] (0), (0.6,18.2] (115), (18.2,Inf] (109) (0,0.6] (0), (0.6,18.2] (173), (18.2,Inf] (34) (0,0.6] (0), (0.6,18.2] (190), (18.2,Inf] (82) (0,0.6] (6), (0.6,18.2] (131), (18.2,Inf] (61) (0,0.6] (6), (0.6,18.2] (922), (18.2,Inf] (373)
Legume Intake During Pregnancy (0,0.5] (0), (0.5,2] (0), (2,Inf] (202) (0,0.5] (1), (0.5,2] (20), (2,Inf] (177) (0,0.5] (8), (0.5,2] (138), (2,Inf] (78) (0,0.5] (0), (0.5,2] (1), (2,Inf] (206) (0,0.5] (226), (0.5,2] (34), (2,Inf] (12) (0,0.5] (10), (0.5,2] (76), (2,Inf] (112) (0,0.5] (245), (0.5,2] (269), (2,Inf] (787)
Meat Intake During Pregnancy (0,6.5] (54), (6.5,10] (46), (10,Inf] (102) (0,6.5] (91), (6.5,10] (67), (10,Inf] (40) (0,6.5] (47), (6.5,10] (117), (10,Inf] (60) (0,6.5] (73), (6.5,10] (35), (10,Inf] (99) (0,6.5] (65), (6.5,10] (66), (10,Inf] (141) (0,6.5] (97), (6.5,10] (56), (10,Inf] (45) (0,6.5] (427), (6.5,10] (387), (10,Inf] (487)
Physical Activity (Moderate) None (5), Often (69), Sometimes (41), Very Often (87) None (3), Often (80), Sometimes (28), Very Often (87) None (11), Often (74), Sometimes (42), Very Often (97) None (7), Often (75), Sometimes (20), Very Often (105) None (6), Often (108), Sometimes (36), Very Often (122) None (10), Often (68), Sometimes (24), Very Often (96) None (42), Often (474), Sometimes (191), Very Often (594)
Physical Activity (Vigorous) High (10), Low (151), Medium (41) High (11), Low (137), Medium (50) High (7), Low (172), Medium (45) High (5), Low (161), Medium (41) High (9), Low (183), Medium (80) High (5), Low (148), Medium (45) High (47), Low (952), Medium (302)
Vegetable Intake During Pregnancy (0,8.8] (124), (8.8,16.5] (78), (16.5,Inf] (0) (0,8.8] (106), (8.8,16.5] (71), (16.5,Inf] (21) (0,8.8] (26), (8.8,16.5] (100), (16.5,Inf] (98) (0,8.8] (113), (8.8,16.5] (93), (16.5,Inf] (1) (0,8.8] (129), (8.8,16.5] (92), (16.5,Inf] (51) (0,8.8] (41), (8.8,16.5] (36), (16.5,Inf] (121) (0,8.8] (539), (8.8,16.5] (470), (16.5,Inf] (292)
MBzP (Log2) 1.90 (1.29) 4.55 (1.38) 3.25 (1.52) 3.13 (1.44) 2.65 (1.23) 2.48 (1.50) 2.98 (1.60)
MECPP (Log2) 4.88 (0.93) 5.52 (1.08) 4.83 (0.78) 4.64 (0.72) 5.17 (0.95) 5.11 (1.13) 5.03 (0.98)
MEHHP (Log2) 3.38 (1.08) 4.97 (1.19) 4.43 (1.24) 3.63 (0.82) 4.23 (1.11) 4.26 (1.37) 4.16 (1.25)
MEHP (Log2) 2.01 (1.27) 3.41 (1.25) 3.10 (1.53) 2.31 (1.21) 3.73 (1.05) 2.82 (1.44) 2.94 (1.43)
MEOHP (Log2) 3.19 (1.09) 4.48 (1.15) 4.18 (1.20) 3.35 (0.91) 3.84 (1.12) 3.59 (1.35) 3.78 (1.22)
MEP (Log2) 7.81 (1.78) 7.13 (1.67) 8.56 (1.59) 8.60 (1.12) 7.24 (2.03) 7.34 (1.85) 7.77 (1.82)
MiBP (Log2) 5.26 (0.97) 5.84 (1.04) 4.91 (1.10) 5.45 (0.67) 5.13 (1.12) 5.39 (1.07) 5.31 (1.05)
MnBP (Log2) 4.56 (1.08) 5.70 (1.14) 4.82 (1.29) 5.11 (1.01) 4.99 (1.06) 4.57 (1.33) 4.96 (1.21)
OH-MiNP (Log2) -0.53 (1.30) -0.64 (1.46) -0.20 (0.68) -0.46 (0.91) 0.43 (1.45) -0.67 (2.48) -0.30 (1.53)
OXO-MiNP (Log2) 0.14 (1.17) -0.45 (1.64) -0.07 (0.95) -0.28 (0.68) 0.41 (1.86) -0.25 (2.39) -0.06 (1.59)

Table by Sex:

# Stratified by sex
table1(~  hs_child_age_None  + e3_alcpreg_yn_None + h_cereal_preg_Ter + h_dairy_preg_Ter +
         h_fastfood_preg_Ter + h_fish_preg_Ter + h_folic_t1_None + h_fruit_preg_Ter +
         h_legume_preg_Ter + h_meat_preg_Ter + h_pamod_t3_None + h_pavig_t3_None +
         h_veg_preg_Ter + hs_mbzp_madj_Log2 + hs_mecpp_madj_Log2 + hs_mehhp_madj_Log2 +
         hs_mehp_madj_Log2 + hs_meohp_madj_Log2 + hs_mep_madj_Log2 + hs_mibp_madj_Log2 +
         hs_mnbp_madj_Log2 + hs_ohminp_madj_Log2 + hs_oxominp_madj_Log2 | e3_sex_None,
       data = exposome_phenotype_covariates,
       render.continuous = render_cont, render.categorical = render_cat,
       overall = TRUE, topclass = "Rtable1-shade")
female
(N=608)
male
(N=693)
TRUE
(N=1301)
Child's Age 7.91 (1.58) 8.03 (1.64) 7.98 (1.61)
Alcohol During Pregnancy 0 (437), 1 (171) 0 (459), 1 (234) 0 (896), 1 (405)
Cereal Intake During Pregnancy (0,9] (246), (9,27.3] (213), (27.3,Inf] (149) (0,9] (285), (9,27.3] (246), (27.3,Inf] (162) (0,9] (531), (9,27.3] (459), (27.3,Inf] (311)
Dairy Intake During Pregnancy (0,17.1] (124), (17.1,27.1] (171), (27.1,Inf] (313) (0,17.1] (146), (17.1,27.1] (209), (27.1,Inf] (338) (0,17.1] (270), (17.1,27.1] (380), (27.1,Inf] (651)
Fast Food Intake During Pregnancy (0,0.25] (50), (0.25,0.83] (248), (0.83,Inf] (310) (0,0.25] (44), (0.25,0.83] (287), (0.83,Inf] (362) (0,0.25] (94), (0.25,0.83] (535), (0.83,Inf] (672)
Fish Intake During Pregnancy (0,1.9] (145), (1.9,4.1] (238), (4.1,Inf] (225) (0,1.9] (198), (1.9,4.1] (252), (4.1,Inf] (243) (0,1.9] (343), (1.9,4.1] (490), (4.1,Inf] (468)
Folic Acid Intake 0 (279), 1 (329) 0 (327), 1 (366) 0 (606), 1 (695)
Fruit Intake During Pregnancy (0,0.6] (3), (0.6,18.2] (425), (18.2,Inf] (180) (0,0.6] (3), (0.6,18.2] (497), (18.2,Inf] (193) (0,0.6] (6), (0.6,18.2] (922), (18.2,Inf] (373)
Legume Intake During Pregnancy (0,0.5] (116), (0.5,2] (124), (2,Inf] (368) (0,0.5] (129), (0.5,2] (145), (2,Inf] (419) (0,0.5] (245), (0.5,2] (269), (2,Inf] (787)
Meat Intake During Pregnancy (0,6.5] (187), (6.5,10] (186), (10,Inf] (235) (0,6.5] (240), (6.5,10] (201), (10,Inf] (252) (0,6.5] (427), (6.5,10] (387), (10,Inf] (487)
Physical Activity (Moderate) None (17), Often (213), Sometimes (95), Very Often (283) None (25), Often (261), Sometimes (96), Very Often (311) None (42), Often (474), Sometimes (191), Very Often (594)
Physical Activity (Vigorous) High (22), Low (449), Medium (137) High (25), Low (503), Medium (165) High (47), Low (952), Medium (302)
Vegetable Intake During Pregnancy (0,8.8] (258), (8.8,16.5] (214), (16.5,Inf] (136) (0,8.8] (281), (8.8,16.5] (256), (16.5,Inf] (156) (0,8.8] (539), (8.8,16.5] (470), (16.5,Inf] (292)
MBzP (Log2) 2.91 (1.54) 3.04 (1.64) 2.98 (1.60)
MECPP (Log2) 5.01 (1.01) 5.05 (0.95) 5.03 (0.98)
MEHHP (Log2) 4.15 (1.25) 4.16 (1.25) 4.16 (1.25)
MEHP (Log2) 2.96 (1.38) 2.92 (1.47) 2.94 (1.43)
MEOHP (Log2) 3.79 (1.22) 3.78 (1.22) 3.78 (1.22)
MEP (Log2) 7.81 (1.87) 7.74 (1.77) 7.77 (1.82)
MiBP (Log2) 5.30 (1.02) 5.32 (1.08) 5.31 (1.05)
MnBP (Log2) 4.95 (1.22) 4.97 (1.20) 4.96 (1.21)
OH-MiNP (Log2) -0.29 (1.63) -0.31 (1.43) -0.30 (1.53)
OXO-MiNP (Log2) 0.08 (1.45) -0.18 (1.69) -0.06 (1.59)

Data Summary Exposures of Outcome: Phenotype (BMI Category)

Since the original data contained 4 BMI categories, the distribution of those categories were not normal or representative of the population. Aggregation of the BMI variables were performed to ensure that the model is more reliable.

After re-categorization, BMI category is moderately imbalanced. Given this moderate imbalance, we have decided to proceed with the model without making any adjustments.

summary_data <- full_dataset %>%
  count(hs_bmi_c_cat) %>%
  mutate(percentage = n / sum(n) * 100)

ggplot(data = summary_data, aes(x = as.factor(hs_bmi_c_cat), y = percentage)) +
  geom_bar(stat = "identity", fill = c("lightblue", "lightgreen")) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = -0.5, size = 4) +
  labs(title = "Distribution of BMI Categories",
       x = "BMI Cateogories",
       y = "Percentage") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))

Correlation Matrix

Below is a correlation matrix to help decide which model is most appropriate for the data.

numeric_vars <- exposome_phenotype_covariates %>%
  dplyr::select(where(is.numeric))

cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 60, tl.cex = 0.8)

find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
  cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA  
  cor_matrix <- as.data.frame(as.table(cor_matrix)) 
  cor_matrix <- na.omit(cor_matrix)  
  cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]  
  cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)  
  return(cor_matrix)
}

highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.60)

The heatmap shows significant multicollinearity among several phthalate variables. Traditional regression models, such as lasso regression, struggle with such correlations, leading to unreliable estimates. Grouped lasso/ridge/e-net logistic regression addresses this by selecting correlated variables This method ensures that highly correlated variables are handled together. The best performing model will used for future model building.

Model Building

Model 1: Used Grouped LASSO, RIDGE, ENET Regression on HELIX Data (not including Metabolmics)

Model 1 applies advanced regression techniques—Grouped LASSO, Grouped Ridge, and Grouped Elastic Net—to the HELIX dataset, which does not include metabolomics data. These methods address the issue of multicollinearity among highly correlated variables, such as phthalates, by selecting or regularizing groups of correlated predictors together. This approach ensures more reliable estimates and reduces overfitting. The best performing model among Grouped LASSO, Grouped Ridge, and Grouped Elastic Net will be selected for future model building.

# Grouped Variables
outcome <- "hs_bmi_c_cat"

covariates_final <- c("h_cohort", "e3_sex_None", "hs_child_age_None") 

phthalates_final <- c("hs_mbzp_madj_Log2", "hs_mecpp_madj_Log2", "hs_mep_madj_Log2", 
                      "hs_mibp_madj_Log2", "hs_mnbp_madj_Log2", "hs_oxominp_madj_Log2", 
                      "hs_mehhp_madj_Log2", "hs_meohp_madj_Log2", "hs_ohminp_madj_Log2", 
                      "hs_mehp_madj_Log2")

diet_final <- c("e3_alcpreg_yn_None", "h_cereal_preg_Ter", "h_dairy_preg_Ter", 
                "h_fastfood_preg_Ter", "h_folic_t1_None", "h_fruit_preg_Ter", 
                "h_meat_preg_Ter", "h_pamod_t3_None", "h_fish_preg_Ter", 
                "h_legume_preg_Ter", "h_pavig_t3_None", "h_veg_preg_Ter")

predictors_final <- c(covariates_final, phthalates_final, diet_final)

# Data Prep
x <- model.matrix(~ . - 1, data = full_dataset[, predictors_final])
y <- as.numeric(as.character(full_dataset[[outcome]]))

# groups
group_ids <- c(rep(1, length(covariates_final)), 
               rep(2, length(phthalates_final)), 
               rep(3, length(diet_final)))


column_names <- colnames(x)
adjusted_groups <- rep(NA, length(column_names))

for (i in seq_along(predictors_final)) {
  predictor <- predictors_final[i]
  group_id <- group_ids[i]
  matched_cols <- grep(paste0("^", predictor), column_names)
  adjusted_groups[matched_cols] <- group_id
}

# Penalization factors (Keep covariates in the model)
penalty_factor <- ifelse(adjusted_groups == 1, 0, 1)

# train-test split (80/20) (TRIED CV-10 folds, but ran into problems)
set.seed(2024)  
train_indices <- createDataPartition(y, p = 0.8, list = FALSE)
x_train <- x[train_indices, ]
y_train <- y[train_indices]
x_test <- x[-train_indices, ]
y_test <- y[-train_indices]

###################
####LASSO##########
###################

# Train Grouped Lasso model
grouped_lasso <- cv.grpreg(x_train, y_train, group = adjusted_groups, family = "binomial", penalty = "grLasso", penalty.factor = penalty_factor, lambda.min.ratio = 1e-5)

# Predict on test set for Lasso
predictions_lasso <- predict(grouped_lasso, x_test, type = "response")

# AUC for Lasso
roc_curve_lasso <- roc(y_test, predictions_lasso)
auc_lasso <- auc(roc_curve_lasso)

# Accuracy for Lasso
predicted_classes_lasso <- ifelse(predictions_lasso > 0.5, 1, 0)
accuracy_lasso <- mean(predicted_classes_lasso == y_test)

###################
###RIDGE###########
###################
# Train Grouped Ridge model
grouped_ridge <- cv.grpreg(x_train, y_train, group = adjusted_groups, family = "binomial", penalty = "gel", penalty.factor = penalty_factor, lambda.min.ratio = 1e-5)

# Predict on test set for Ridge
predictions_ridge <- predict(grouped_ridge, x_test, type = "response")

# AUC for Ridge
roc_curve_ridge <- roc(y_test, predictions_ridge)
auc_ridge <- auc(roc_curve_ridge)

# Accuracy for Ridge
predicted_classes_ridge <- ifelse(predictions_ridge > 0.5, 1, 0)
accuracy_ridge <- mean(predicted_classes_ridge == y_test)

##################
###ENET###########
##################

# Train Elastic Net model
cv_glmnet_model <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 0.5, lambda = 10^seq(-4, 4, length.out = 100), penalty.factor = penalty_factor)

# Predict on test set for Elastic Net
predictions_enet <- predict(cv_glmnet_model, s = "lambda.min", newx = x_test, type = "response")

# AUC for Elastic Net
roc_curve_enet <- roc(y_test, predictions_enet)
auc_enet <- auc(roc_curve_enet)

# Accuracy for Elastic Net
predicted_classes_enet <- ifelse(predictions_enet > 0.5, 1, 0)
accuracy_enet <- mean(predicted_classes_enet == y_test)

#print(paste("Lasso - AUC: ", auc_lasso))
#print(paste("Ridge - AUC: ", auc_ridge))
#print(paste("Elastic Net - AUC: ", auc_enet))

roc_data_lasso <- data.frame(
  specificity = 1 - roc_curve_lasso$specificities,
  sensitivity = roc_curve_lasso$sensitivities,
  Model = "Lasso"
)

roc_data_ridge <- data.frame(
  specificity = 1 - roc_curve_ridge$specificities,
  sensitivity = roc_curve_ridge$sensitivities,
  Model = "Ridge"
)

roc_data_enet <- data.frame(
  specificity = 1 - roc_curve_enet$specificities,
  sensitivity = roc_curve_enet$sensitivities,
  Model = "Elastic Net"
)

roc_data_combined <- rbind(roc_data_lasso, roc_data_ridge, roc_data_enet)

# ROC curve visualization
ggplot(roc_data_combined, aes(x = specificity, y = sensitivity, color = Model)) +
  geom_line(size = 1) +
  geom_abline(linetype = "dashed", color = "red") +
  labs(title = "ROC Curves for Lasso, Ridge, and Elastic Net",
       x = "1 - Specificity",
       y = "Sensitivity") +
  theme_minimal()

# Train Grouped Lasso model
grouped_lasso <- cv.grpreg(x_train, y_train, group = adjusted_groups, family = "binomial", penalty = "grLasso", penalty.factor = penalty_factor, lambda.min.ratio = 1e-5)

# Important variables
coef_lasso <- coef(grouped_lasso, s = "lambda.min")

# Non-zero coefficients and remove intercept
important_vars <- coef_lasso[coef_lasso != 0]
important_vars <- important_vars[names(important_vars) != "(Intercept)"]

# Important variables
important_vars_df <- data.frame(
  Variable = names(important_vars),
  Coefficient = as.numeric(important_vars)
)

In this analysis, three different regression models (Grouped Lasso, Group Ridge, and Group Elastic Net) were applied to the HELIX dataset, excluding metabolomics data. Covariates were forced into the models.

The models were trained using 10-fold cross-validation. The Lasso model achieved an Area Under the Curve (AUC) of 0.588. The Ridge model yielded an AUC of 0.584. The Elastic Net model resulted in an AUC of 0.585. Overall, the Lasso model performed slightly better in terms of AUC compared to Ridge and Elastic Net.

Due to its performance, both Grouped Lasso and regular Lasso will be considered for Model 2A, Model 2B, Model 2C, and Model 2D in future analyses.

# Train Grouped Lasso model
grouped_lasso <- cv.grpreg(x_train, y_train, group = adjusted_groups, family = "binomial", 
                           penalty = "grLasso", penalty.factor = penalty_factor, 
                           lambda.min.ratio = 1e-5)


coef_lasso <- coef(grouped_lasso, s = "lambda.min")

# Remove intercpt and filter non-zero coefficients
important_vars <- coef_lasso[coef_lasso != 0]
important_vars <- important_vars[names(important_vars) != "(Intercept)"]


important_vars_df <- data.frame(
  Variable = names(important_vars),
  Coefficient = as.numeric(important_vars)
)



ggplot(important_vars_df, aes(x = reorder(Variable, Coefficient), y = Coefficient)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Important Variables from Grouped Lasso Regression",
       x = "Variables",
       y = "Coefficient") +
  theme_minimal()

The bar plot above visualizes the coefficients of important variables identified by the Grouped Lasso Regression model applied to the HELIX dataset (excluding metabolomics data). This analysis provides insights into which covariates, phthalates, and dietary factors have significant impacts on the outcome variable. The plot shows:

Positive Associations:

h_cohort3, h_cohort6, h_cohort4: These variables have the strongest positive coefficients with the outcome varaible hs_bmi_category h_pamod_t3_NoneVery Often, h_pamod_t3_NoneSometimes, h_fastfood_preg_Ter[0.83,1.79], h_meat_preg_Ter[6.5,10.4]: These dietary variables also show positive associations.

Negative Associations: h_fruit_preg_Ter[0.6,1.82], h_fruit_preg_Ter[2.3,Inf], h_veg_preg_Ter[8.6,15.6]: These dietary variables exhibit the strongest negative coefficients with the outcome variable hs_bmi_c_cat. h_cereal_preg_Ter[7.8,21.2], h_cereal_preg_Ter[2.3,Inf], h_veg_preg_Ter[16.5,Inf]: Other dietary variables that have negative associations.

Moderate Associations: e3_sex_Nonefemale, h_folic_t1_None, hs_child_age_None: These variables show moderate positive coefficients. h_legume_preg_Ter[0.5,2], h_meat_preg_Ter[10,Inf], h_pavig_t3_NoneLow: Variables with moderate negative associations.

Model 2A, Model 2B, Model 2C, and Model 2D: Regression Analysis for BMI Category (including Metabolomics)

The objective of Model 2A, 2B, 2C and 2D is to explore the effect of adding groups of variables and the impact of these variables on the outcome through variable importance.

full_data_with_meta_imputed <- full_data_with_meta %>%
  mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

full_dataset$hs_bmi_c_cat <- as.factor(full_dataset$hs_bmi_c_cat)
full_dataset$e3_sex_None <- as.factor(full_dataset$e3_sex_None)
full_data_with_meta_imputed$hs_bmi_c_cat <- as.factor(full_data_with_meta_imputed$hs_bmi_c_cat)
full_data_with_meta_imputed$e3_sex_None <- as.factor(full_data_with_meta_imputed$e3_sex_None)

# Formula for the models
formula1 <- hs_bmi_c_cat ~ h_cohort + e3_sex_None + hs_child_age_None
formula2 <- hs_bmi_c_cat ~ h_cohort + e3_sex_None + hs_child_age_None + 
            e3_alcpreg_yn_None + h_cereal_preg_Ter + h_dairy_preg_Ter + 
            h_fastfood_preg_Ter + h_folic_t1_None + h_fruit_preg_Ter + 
            h_meat_preg_Ter + h_pamod_t3_None

# cross-validation setup
train_control <- trainControl(method = "cv", number = 10, classProbs = TRUE, summaryFunction = twoClassSummary)

# Train and test the models using LASSO with CV

# Model A
x1 <- model.matrix(formula1, data = full_dataset)[, -1]
y <- full_dataset$hs_bmi_c_cat
cv.lasso1 <- cv.glmnet(x1, y, alpha = 1, family = "binomial", nfolds = 10)
best_lambda1 <- cv.lasso1$lambda.min

# Model B
x2 <- model.matrix(formula2, data = full_dataset)[, -1]
cv.lasso2 <- cv.glmnet(x2, y, alpha = 1, family = "binomial", nfolds = 10)
best_lambda2 <- cv.lasso2$lambda.min

# Evaluate Model A
predictions1 <- predict(cv.lasso1, s = best_lambda1, newx = x1, type = "response")
roc1 <- roc(y, as.numeric(predictions1))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc1 <- auc(roc1)
predicted_classes1 <- ifelse(predictions1 > 0.5, 1, 0)
accuracy1 <- mean(predicted_classes1 == y)

# Evaluate Model B
predictions2 <- predict(cv.lasso2, s = best_lambda2, newx = x2, type = "response")
roc2 <- roc(y, as.numeric(predictions2))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc2 <- auc(roc2)
predicted_classes2 <- ifelse(predictions2 > 0.5, 1, 0)
accuracy2 <- mean(predicted_classes2 == y)

# ROC curves data for ggplot2
roc_data1 <- data.frame(
  specificity = 1 - roc1$specificities,
  sensitivity = roc1$sensitivities,
  Model = "Model 2A: Covariates"
)

roc_data2 <- data.frame(
  specificity = 1 - roc2$specificities,
  sensitivity = roc2$sensitivities,
  Model = "Model 2B: Covariates + Diet"
)

# Imputation
full_data_with_meta_imputed <- full_data_with_meta_imputed %>%
  mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Convert outcome variable
y_imputed <- as.numeric(as.factor(full_data_with_meta_imputed$hs_bmi_c_cat)) - 1

# Metabolites variables
metabolites_final <- grep("^meta", names(full_data_with_meta_imputed), value = TRUE)

# Model C: Covariates + Diet + Phthalates
predictors_model3 <- c(covariates_final, phthalates_final, diet_final)
X_model3 <- model.matrix(~ . - 1, data = full_data_with_meta_imputed[, predictors_model3])
group_structure_model3 <- c(rep(1, length(covariates_final)),
                            rep(2, length(phthalates_final)),
                            rep(3, length(diet_final)))

# Penalization factors (0 for covariates to force them in the model)
penalty_factor_model3 <- ifelse(group_structure_model3 == 1, 0, 1)

# Fit group lasso model using grpreg for Model C
cv_fit_model3 <- cv.grpreg(X_model3, y_imputed, group = group_structure_model3, penalty = "grLasso", penalty.factor = penalty_factor_model3, nfolds = 10)
best_lambda_model3 <- cv_fit_model3$lambda.min
best_fit_model3 <- grpreg(X_model3, y_imputed, group = group_structure_model3, penalty = "grLasso", penalty.factor = penalty_factor_model3, lambda = best_lambda_model3)
predictions_model3 <- predict(best_fit_model3, X_model3, type = "response")
roc_curve_model3 <- roc(y_imputed, as.numeric(predictions_model3))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value_model3 <- auc(roc_curve_model3)

# Accuracy for Model C
predicted_classes_model3 <- ifelse(predictions_model3 > 0.5, 1, 0)
accuracy_model3 <- mean(predicted_classes_model3 == y_imputed)

# Model D: Covariates + Diet + Phthalates + Metabolites
predictors_model4 <- c(covariates_final, phthalates_final, diet_final, metabolites_final)
X_model4 <- model.matrix(~ . - 1, data = full_data_with_meta_imputed[, predictors_model4])
group_structure_model4 <- c(rep(1, length(covariates_final)),
                            rep(2, length(phthalates_final)),
                            rep(3, length(diet_final)),
                            rep(4, length(metabolites_final)))

# Penalization factors (0 for covariates to force them in the model)
penalty_factor_model4 <- ifelse(group_structure_model4 == 1, 0, 1)

# Fit group lasso model using grpreg for Model D
cv_fit_model4 <- cv.grpreg(X_model4, y_imputed, group = group_structure_model4, penalty = "grLasso", penalty.factor = penalty_factor_model4, nfolds = 10)
best_lambda_model4 <- cv_fit_model4$lambda.min
best_fit_model4 <- grpreg(X_model4, y_imputed, group = group_structure_model4, penalty = "grLasso", penalty.factor = penalty_factor_model4, lambda = best_lambda_model4)
predictions_model4 <- predict(best_fit_model4, X_model4, type = "response")
roc_curve_model4 <- roc(y_imputed, as.numeric(predictions_model4))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value_model4 <- auc(roc_curve_model4)

# Accuracy for Model D
predicted_classes_model4 <- ifelse(predictions_model4 > 0.5, 1, 0)
accuracy_model4 <- mean(predicted_classes_model4 == y_imputed)

# ROC curves data for ggplot2
roc_data_model3 <- data.frame(
  specificity = 1 - roc_curve_model3$specificities,
  sensitivity = roc_curve_model3$sensitivities,
  Model = "Model 2C: Covariates + Diet + Chemicals"
)

roc_data_model4 <- data.frame(
  specificity = 1 - roc_curve_model4$specificities,
  sensitivity = roc_curve_model4$sensitivities,
  Model = "Model 2D: Covariates + Diet + Chemicals + Metabolites"
)

# Combine all ROC data
roc_data_combined <- rbind(roc_data1, roc_data2, roc_data_model3, roc_data_model4)

# ROC curves with ggplot2
ggplot(roc_data_combined, aes(x = specificity, y = sensitivity, color = Model)) +
  geom_line(size = 1) +
  geom_abline(linetype = "dashed", color = "grey") +
  labs(title = "ROC Curve Comparison for All Models",
       x = "1 - Specificity",
       y = "Sensitivity") +
  theme_minimal() +
  theme(legend.position = "right") +
  scale_color_manual(values = c(
    "Model 2A: Covariates" = "lightgreen", 
    "Model 2B: Covariates + Diet" = "lightpink",
    "Model 2C: Covariates + Diet + Chemicals" = "lavender", 
    "Model 2D: Covariates + Diet + Chemicals + Metabolites" = "lightblue"
  ))

#print(paste("AUC for Model 2A (Covariates):", auc1))
#print(paste("AUC for Model 2B (Covariates + Diet):", auc2))
#print(paste("AUC for Model 2C (Covariates + Diet + Chemicals):", auc_value_model3))
#print(paste("AUC for Model 2D (Covariates + Diet + Chemicals + Metabolites):", auc_value_model4))

Top 10 Variable Importance

get_top_var_importance <- function(var_importance, model_name) {
  var_importance %>%
    filter(Coefficient != 0) %>%
    top_n(10, abs(Coefficient)) %>%
    mutate(Model = model_name)
}

# grab variable importance from a model
extract_var_importance <- function(model, lambda) {
  coef_model <- as.matrix(coef(model, s = lambda))
  var_importance <- data.frame(
    Variable = rownames(coef_model),
    Coefficient = coef_model[, 1]
  )
  var_importance <- var_importance %>%
    filter(Variable != "(Intercept)")
  return(var_importance)
}


var_importance1 <- extract_var_importance(cv.lasso1, cv.lasso1$lambda.min)
var_importance2 <- extract_var_importance(cv.lasso2, cv.lasso2$lambda.min)
var_importance3 <- extract_var_importance(best_fit_model3, cv_fit_model3$lambda.min)
var_importance4 <- extract_var_importance(best_fit_model4, cv_fit_model4$lambda.min)

# 10 variable importance for each model
top_var_importance1 <- get_top_var_importance(var_importance1, "Model A")
top_var_importance2 <- get_top_var_importance(var_importance2, "Model B")
top_var_importance3 <- get_top_var_importance(var_importance3, "Model C")
top_var_importance4 <- get_top_var_importance(var_importance4, "Model D")


top_var_importance_combined <- bind_rows(top_var_importance1, top_var_importance2, top_var_importance3, top_var_importance4)

ggplot(top_var_importance_combined, aes(x = reorder(Variable, abs(Coefficient)), y = abs(Coefficient), fill = Model)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Variable Importance Across All Models",
       x = "Variable",
       y = "Absolute Coefficient") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 15, face = "bold"),
        strip.text = element_text(size = 10, face = "bold")) +
  scale_fill_manual(values = c(
    "Model A" = "lightgreen", 
    "Model B" = "lightpink",
    "Model C" = "lavender", 
    "Model D" = "lightblue"
  )) +
  facet_wrap(~ Model, scales = "free_y")

The models were trained using 10-fold cross-validation. Model 2A, which includes only covariates, achieved an AUC of 0.624. Model 2B, which adds diet variables to the covariates, improved the AUC to 0.647. Model 2C, which includes covariates, diet variables, and chemical exposures, achieved an AUC of 0.646. The best performance was observed in Model 2D, which includes covariates, diet variables, chemical exposures, and metabolomics data, with an AUC of 0.925.

The ROC curve comparison and variable importance plotsshows the performance and key predictors for each model. The inclusion of metabolomics data in Model 2D significantly enhanced the model’s predictive ability. These findings highlight the importance of considering comprehensive sets of variables, including metabolomics data, to improve the accuracy of predictive models for BMI categories. The analysis also demonstrates the value of examining variable importance to identify key predictors and gain insights into the most influential factors affecting BMI.

The objective of this analysis is to explore the effect of adding groups of variables, including metabolomics data, on the predictive performance of BMI categories and to assess the impact of these variables on the outcome through variable importance.

The models were trained using 10-fold cross-validation. Model 2A, which includes only covariates, achieved an AUC of 0.624. Model 2B, which adds diet variables to the covariates, improved the AUC to 0.647. Model 2C, which includes covariates, diet variables, and chemical exposures, achieved an AUC of 0.646. The best performance was observed in Model 2D, which includes covariates, diet variables, chemical exposures, and metabolomics data, with an AUC of 0.925.

The ROC curve comparison and variable importance plots shows the performance and important predictors for each model.

Comparison and Contrast of Top 10 Variable Importance Across Models:

Model 2A: Covariates (Grouped Lasso and Lasso) * Key Predictors: The most influential variables are related to the cohort categories (h_cohort3, h_cohort6, h_cohort4, h_cohort5, h_cohort2), sex (e3_sex_Nonemale), and child age (hs_child_age_None).

  • Importance: Cohort variables dominate the top 10 list, indicating that demographic characteristics and the specific cohort play a significant role in predicting BMI categories when only covariates are considered.

Model 2B: Covariates + Diet (Grouped Lasso and Lasso) * Inportant predictors: The addition of diet variables introduces h_cereal_preg_Ter, h_meat_preg_Ter, h_fruit_preg_Ter, and h_pamod_t3_NoneVery Often as important predictors, alongside the cohort variables (h_cohort3, h_cohort6, h_cohort5). * Importance: Diet variables such as cereal and meat consumption during pregnancy become significant, suggesting that dietary factors and demographic characteristics, strengthens the model’s predictive power.

Model 2C: Covariates + Diet + Chemicals (Grouped Lasso and Lasso) * Important Predictors: This model includes chemical exposures, with h_meccp_madj_Log2 and hs_oxominp_madj_Log2 appearing in the top 10 alongside the cohort and diet variables (h_cereal_preg_Ter, h_fruit_preg_Ter, h_fastfood_preg_Ter). * Importance: The inclusion of chemical exposures shows that both dietary factors and chemical exposures are important in predicting BMI categories, highlighting the complex interplay between different types of variables.

Model 2D: Covariates + Diet + Chemicals + Metabolites (Grouped Lasso and Lasso) * Important Predictors: Metabolomics data dominates the top 10 variables, with metabolites such as metab_161, metab_160, metab_125, metab_138, and metab_58 being the most important predictors. * Importance: The introduction of metabolomics data drastically changes the variable importance landscape, with metabolites becoming the primary predictors. This indicates the significant predictive power of detailed biochemical markers from urine metabolomics data in determining BMI categories.

Comparison and Contrast Summary * Cohort Variables: Consistently important across all models, especially in Models 2A and 2B. * Diet Variables: Gain importance in Models 2B and 2C, indicating the added value of dietary information. * Chemical Exposures: Introduced in Model 2C, showing moderate importance alongside diet and cohort variables. * Metabolomics Data: Dominates Model 2D, significantly improving the model’s AUC.

Since Model 2D performs the best in terms of AUC values, Model 2D will be applied to other algorthims an evlauated by Random Forest and Gradient BOosting Althogrithim.

Model 3A: Random Forest Model

full_data_with_meta_imputed$hs_bmi_c_cat <- as.factor(full_data_with_meta_imputed$hs_bmi_c_cat)
full_data_with_meta_imputed$e3_sex_None <- as.factor(full_data_with_meta_imputed$e3_sex_None)


covariates_final <- c('h_cohort', 'e3_sex_None', 'hs_child_age_None')
phthalates_final <- c('hs_mbzp_madj_Log2', 'hs_mecpp_madj_Log2', 'hs_mep_madj_Log2', 'hs_mibp_madj_Log2', 'hs_mnbp_madj_Log2', 'hs_oxominp_madj_Log2')
diet_final <- c('e3_alcpreg_yn_None', 'h_cereal_preg_Ter', 'h_dairy_preg_Ter', 'h_fastfood_preg_Ter', 'h_folic_t1_None', 'h_fruit_preg_Ter', 'h_meat_preg_Ter', 'h_pamod_t3_None')

# Model D: Covariates + Diet + Phthalates + Metabolites
metabolites_final <- grep("^meta", names(full_data_with_meta_imputed), value = TRUE)
predictors_model4 <- c(covariates_final, phthalates_final, diet_final, metabolites_final)
X_model4 <- full_data_with_meta_imputed[, predictors_model4]
y_model4 <- full_data_with_meta_imputed$hs_bmi_c_cat


y_model4 <- factor(y_model4)

# cross-validation setup
set.seed(2024)
folds <- createFolds(y_model4, k = 10, list = TRUE, returnTrain = TRUE)


auc_values <- vector()
variable_importance <- vector("list", length = 10)

# cross-validation =19
for (i in 1:10) {
  train_indices <- folds[[i]]
  test_indices <- setdiff(seq_len(nrow(X_model4)), train_indices)
  
  X_train <- X_model4[train_indices, ]
  y_train <- y_model4[train_indices]
  X_test <- X_model4[test_indices, ]
  y_test <- y_model4[test_indices]
  
  # Train the RF model
  rf_model <- randomForest(x = X_train, y = y_train, importance = TRUE, ntree = 500)
  
  # Predictions on the test set
  rf_predictions <- predict(rf_model, X_test, type = "prob")[, 2]
  
  # AUC
  roc_rf <- roc(as.numeric(y_test), rf_predictions)
  auc_rf <- auc(roc_rf)
  auc_values <- c(auc_values, auc_rf)
  

  variable_importance[[i]] <- importance(rf_model)
}

# Average AUC
mean_auc <- mean(auc_values)
print(paste("Mean AUC for Random Forest Model D:", mean_auc))
## [1] "Mean AUC for Random Forest Model D: 0.742264157467247"
# Average variable importance
mean_importance <- Reduce("+", variable_importance) / length(variable_importance)
importance_df <- data.frame(Variable = rownames(mean_importance), MeanDecreaseGini = mean_importance[, 'MeanDecreaseGini'])

# top 10 variables
top10_importance_df <- importance_df %>%
  arrange(desc(MeanDecreaseGini)) %>%
  top_n(10, MeanDecreaseGini)

# Var Importance top 10 variables
ggplot(top10_importance_df, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Variable Importance - Random Forest Model D", x = "Variable", y = "Mean Decrease in Gini")

# Plot ROC curve for the last fold
roc_data_rf <- data.frame(
  specificity = 1 - roc_rf$specificities,
  sensitivity = roc_rf$sensitivities,
  Model = "Random Forest Model D"
)

# ROC curve with ggplot2
ggplot(roc_data_rf, aes(x = specificity, y = sensitivity, color = Model)) +
  geom_line(size = 1) +
  geom_abline(linetype = "dashed", color = "grey") +
  labs(title = "ROC Curve for Random Forest Model D",
       x = "1 - Specificity",
       y = "Sensitivity") +
  theme_minimal() +
  theme(legend.position = "right") +
  scale_color_manual(values = c("Random Forest Model D" = "darkblue"))

Model 3B: Gradient Boosting Machine (GBM)

AUC

full_data_with_meta_imputed$hs_bmi_c_cat <- as.numeric(as.factor(full_data_with_meta_imputed$hs_bmi_c_cat)) - 1
full_data_with_meta_imputed$e3_sex_None <- as.factor(full_data_with_meta_imputed$e3_sex_None)


full_data_with_meta_imputed <- full_data_with_meta_imputed %>%
  mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

#Redefine
covariates_final <- c('h_cohort', 'e3_sex_None', 'hs_child_age_None')
phthalates_final <- c('hs_mbzp_madj_Log2', 'hs_mecpp_madj_Log2', 'hs_mep_madj_Log2', 'hs_mibp_madj_Log2', 'hs_mnbp_madj_Log2', 'hs_oxominp_madj_Log2')
diet_final <- c('e3_alcpreg_yn_None', 'h_cereal_preg_Ter', 'h_dairy_preg_Ter', 'h_fastfood_preg_Ter', 'h_folic_t1_None', 'h_fruit_preg_Ter', 'h_meat_preg_Ter', 'h_pamod_t3_None')

# Model D: Covariates + Diet + Phthalates + Metabolites
metabolites_final <- grep("^meta", names(full_data_with_meta_imputed), value = TRUE)
predictors_model4 <- c(covariates_final, phthalates_final, diet_final, metabolites_final)
X_model4 <- full_data_with_meta_imputed[, predictors_model4]
y_model4 <- full_data_with_meta_imputed$hs_bmi_c_cat

# cross-validation setup
set.seed(2024)
folds <- createFolds(y_model4, k = 10, list = TRUE)


auc_values <- vector()
variable_importance <- vector("list", length = 10)

#  cross-validation = 10
for (i in 1:10) {
  train_indices <- folds[[i]]
  test_indices <- setdiff(seq_len(nrow(X_model4)), train_indices)
  
  X_train <- X_model4[train_indices, ]
  y_train <- y_model4[train_indices]
  X_test <- X_model4[test_indices, ]
  y_test <- y_model4[test_indices]
  
  # Train the GBM model
  gbm_model <- gbm(
    formula = y_train ~ .,
    data = data.frame(X_train, y_train),
    distribution = "bernoulli",
    n.trees = 500,
    interaction.depth = 3,
    shrinkage = 0.01,
    cv.folds = 10,
    n.minobsinnode = 10,
    verbose = FALSE
  )
  
  # Best # of trres 
  best_trees <- gbm.perf(gbm_model, method = "cv")
  
  # Predictions on the test set
  gbm_predictions <- predict(gbm_model, newdata = X_test, n.trees = best_trees, type = "response")
  
  # Calculate AUC
  roc_gbm <- roc(y_test, gbm_predictions)
  auc_gbm <- auc(roc_gbm)
  auc_values <- c(auc_values, auc_gbm)
  

  variable_importance[[i]] <- summary(gbm_model, n.trees = best_trees, plotit = FALSE)
}
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Average AUC
mean_auc <- mean(auc_values)
print(paste("Mean AUC for GBM Model D:", mean_auc))
## [1] "Mean AUC for GBM Model D: 0.65591221276392"
# Average variable importance
mean_importance <- do.call(rbind, variable_importance) %>%
  group_by(var) %>%
  summarize(rel.inf = mean(rel.inf)) %>%
  arrange(desc(rel.inf))

# Select top 20 variables
top20_importance_gbm <- mean_importance %>%
  top_n(20, rel.inf)

# VarImportance for top 10 variables
ggplot(top20_importance_gbm, aes(x = reorder(var, rel.inf), y = rel.inf)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 20 Variable Importance - GBM Model D", x = "Variable", y = "Relative Importance")

#ROC Curve 
roc_data_gbm <- data.frame(
  specificity = 1 - roc_gbm$specificities,
  sensitivity = roc_gbm$sensitivities,
  Model = "GBM Model D"
)

#ROC curve with ggplot2
ggplot(roc_data_gbm, aes(x = specificity, y = sensitivity, color = Model)) +
  geom_line(size = 1) +
  geom_abline(linetype = "dashed", color = "grey") +
  labs(title = "ROC Curve for GBM Model D",
       x = "1 - Specificity",
       y = "Sensitivity") +
  theme_minimal() +
  theme(legend.position = "right") +
  scale_color_manual(values = c("GBM Model D" = "darkblue"))

Conclusion: Comparison of All Models

auc_data <- data.frame(
  Model = c("GBM Model D", "Random Forest Model D", "Group LASSO Model D"),
  Mean_AUC = c(0.64276948435256, 0.735747185689977, 0.924780193565975)
)

# Reordering the factor levels 
auc_data$Model <- factor(auc_data$Model, levels = c("Group LASSO Model D", "Random Forest Model D", "GBM Model D"))


ggplot(auc_data, aes(x = Model, y = Mean_AUC, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Mean_AUC, 3)), vjust = -0.2) +
  labs(title = "AUC Values for Different Models", x = "Model", y = "Mean AUC") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2") 

Reference: https://stat.ethz.ch/Manuscripts/buhlmann/logistic-grouplasso-final.pdf https://cran.r-project.org/web/packages/grplasso/grplasso.pdf